Introduction to Open Data Science - Course Project

Course diary entry #1

I’ve been using R and data analysis tools for a while already, but I like so many digital humanists, haven’t studied these specific toosl formally. This course seems like a good place to learn about the tools and methods in a more structured way, and the course programme looks very promising.

Reflections of Exercise Set 1

I’m familiar with the basic structures of R ,so this was pretty smooth sailing. I’ve encountered some of the tidyverse stuff before, but haven’t really studied it. This introduction to it was quite enlightening, and I now have a sighnificantly better understanding of th tidyverse structures. I can see the merits of the pipe (%>%) operator: It makes following the workflow steps very easy.

tibble() was a new structure to me too, and that’s going to be useful too. The promise of stricter adherence to input variable types and less unhelpful “help” when compared to dataframes sounds like a way to avoid a lot headaches.

The R for Health Data Science was very easy to follow, and is well structured. I’m pretty sure I’m goinmg to use it as an R reference in the future, it is a lot clearer than much of the official R documentation.

Personal project github repo

My github repo for the course is here: https://github.com/villevaara/IODS-project


Week 2 exercises

Read the students2014 data into R either from your local folder (if you completed the Data wrangling part) or from this url: https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt . (The separator is a comma “,” and the file includes a header). Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. There is information related to the data here. (0-2 points)

library(readr)
learning2014 <- read_csv("data/learning2014.csv")
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): age, attitude, deep, stra, surf, points
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(learning2014)
## [1] 166   7
# the data has 166 rows, 7 columns of variables for each row.
str(learning2014)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   points = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# ... the variables are as listed by the above function, with gender being of chr type, rest numeric.

The data is study methods survey data from a survey conducted in 2014. It has been summarised above, with various categories given mean score of wider variety of data points for each respondent.

Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

p <- ggpairs(learning2014, mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)))
p

qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## `geom_smooth()` using formula = 'y ~ x'

qplot(stra, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

qplot(surf, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

The function produces a really nice overview easily. I wonder if it is that easy without so fitting data. Some variables seem to correlate strongly with points, especially attitude. There’s also something curious going on between surface and deep learning methods. Males seem to have a strong inverse correlation there, while females do not. The sample has female respondents over represented though. Maybe male respondents either employ deep or surface methods, and female respondents might employ both.

Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent, outcome) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it. (0-4 points)

# fit a linear model
p_surf_model3 <- lm(points ~ attitude + stra + surf, data = learning2014)
summary(p_surf_model3)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
p_model1 <- lm(points ~ attitude, data = learning2014)
summary(p_model1)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Neither ‘stra’ nor ‘surf’ seem to be statistically significant. At least if I got this right. It looks like the results are roughly the same without those two, as with them. ‘attitude’ is te sole significant variable when testing effects on ‘points’.

Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R-squared of the model. (0-3 points)

p_att_model <- lm(points ~ attitude, data = learning2014)
summary(p_att_model)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The model based on attitude explains ~19% of the variability in points. That’s the strongest, and almost sole variable having a significant effect on points. Adding in stra and surf did raise multiple R-squared a hair, but adding all three already made adjusted R-squared drop a bit. My understanding is that this might then not be helpful, as that would needlessly complicate the model.

Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots. (0-3 points)

plot(p_att_model, which = c(1,2,5))

The material states that linear regression modelling has four main assumptions:

  1. Linear relationship between predictors and outcome;
  2. Independence of residuals;
  3. Normal distribution of residuals;
  4. Equal variance of residuals.

Residuals vs fitted - The residuals seem to show a normal distribution with a fit of zero, as they should.

Normal QQ-plot - The residuals seem to diverge from a straight line here a bit at the start ansd end. That might be alarming? But is it enough? I’m not sure I’m yet qualified to answer.

Residuals vs leverage - All the observations fall within Cook’s distance, so there are no especially influential observations skewing the results.


Week 3 exercises

Data

library(readr)
alc <- read_csv("data/alc.csv")
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl  (1): high_use
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(alc)
## [1] 370  35
str(alc)
## spc_tbl_ [370 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ school    : chr [1:370] "GP" "GP" "GP" "GP" ...
##  $ sex       : chr [1:370] "F" "F" "F" "F" ...
##  $ age       : num [1:370] 18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr [1:370] "U" "U" "U" "U" ...
##  $ famsize   : chr [1:370] "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr [1:370] "A" "T" "T" "T" ...
##  $ Medu      : num [1:370] 4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : num [1:370] 4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr [1:370] "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr [1:370] "teacher" "other" "other" "services" ...
##  $ reason    : chr [1:370] "course" "course" "other" "home" ...
##  $ guardian  : chr [1:370] "mother" "father" "mother" "mother" ...
##  $ traveltime: num [1:370] 2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : num [1:370] 2 2 2 3 2 2 2 2 2 2 ...
##  $ schoolsup : chr [1:370] "yes" "no" "yes" "no" ...
##  $ famsup    : chr [1:370] "no" "yes" "no" "yes" ...
##  $ activities: chr [1:370] "no" "no" "no" "yes" ...
##  $ nursery   : chr [1:370] "yes" "no" "yes" "yes" ...
##  $ higher    : chr [1:370] "yes" "yes" "yes" "yes" ...
##  $ internet  : chr [1:370] "no" "yes" "yes" "yes" ...
##  $ romantic  : chr [1:370] "no" "no" "no" "yes" ...
##  $ famrel    : num [1:370] 4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : num [1:370] 3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : num [1:370] 4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : num [1:370] 1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : num [1:370] 1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : num [1:370] 3 3 3 5 5 5 3 1 1 5 ...
##  $ failures  : num [1:370] 0 0 2 0 0 0 0 0 0 0 ...
##  $ paid      : chr [1:370] "no" "no" "yes" "yes" ...
##  $ absences  : num [1:370] 5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : num [1:370] 2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : num [1:370] 8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : num [1:370] 8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num [1:370] 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi [1:370] FALSE FALSE TRUE FALSE FALSE FALSE ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   school = col_character(),
##   ..   sex = col_character(),
##   ..   age = col_double(),
##   ..   address = col_character(),
##   ..   famsize = col_character(),
##   ..   Pstatus = col_character(),
##   ..   Medu = col_double(),
##   ..   Fedu = col_double(),
##   ..   Mjob = col_character(),
##   ..   Fjob = col_character(),
##   ..   reason = col_character(),
##   ..   guardian = col_character(),
##   ..   traveltime = col_double(),
##   ..   studytime = col_double(),
##   ..   schoolsup = col_character(),
##   ..   famsup = col_character(),
##   ..   activities = col_character(),
##   ..   nursery = col_character(),
##   ..   higher = col_character(),
##   ..   internet = col_character(),
##   ..   romantic = col_character(),
##   ..   famrel = col_double(),
##   ..   freetime = col_double(),
##   ..   goout = col_double(),
##   ..   Dalc = col_double(),
##   ..   Walc = col_double(),
##   ..   health = col_double(),
##   ..   failures = col_double(),
##   ..   paid = col_character(),
##   ..   absences = col_double(),
##   ..   G1 = col_double(),
##   ..   G2 = col_double(),
##   ..   G3 = col_double(),
##   ..   alc_use = col_double(),
##   ..   high_use = col_logical()
##   .. )
##  - attr(*, "problems")=<externalptr>
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
# ... the variables are as listed by the above function, with gender being of chr type, rest numeric.

The data in ‘alc’ describes questionnaire data on students in two Portuguese schools. It has student metadata, and details on their school performance and alcohol consumption.

The purpose of your analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, choose 4 interesting variables in the data and for each of them, present your personal hypothesis about their relationships with alcohol consumption. (0-1 point)

Hypothesis

  1. High alcohol usage probably correlates with number of absences.
  2. Volume of alcohol usage is likely strongly gendered.
  3. High alcohol usage probably weakly inversely correlates with ‘G3’ (final year grade).
  4. High alcohol usage likely correlates with going out with friends (‘goout’).

Data exploration

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
alc_subset = select(alc, c('alc_use', 'high_use', 'sex', 'goout', 'G3', 'absences'))

gather(alc_subset) %>% glimpse()
## Rows: 2,220
## Columns: 2
## $ key   <chr> "alc_use", "alc_use", "alc_use", "alc_use", "alc_use", "alc_use"…
## $ value <chr> "1", "1", "2.5", "1", "1.5", "1.5", "1", "1", "1", "1", "1.5", "…
gather(alc_subset) %>% View()

gather(alc_subset) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

Absences and G3 score seem to have high numbers in both ends of the scale. Absences also has a strange peak in the middle, which might be an artifact of school practices. Maybe X amount of absences either automatically disqualifies one from the class, or X amount is the maximum permitted amount. Alcohol use is weighted towards the lower end of the scale, out-goingness is somewhat normally distributed and male:female ratio is close to 1:1.

library(dplyr)
alc_subset %>% group_by(sex, high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 3
## # Groups:   sex [2]
##   sex   high_use count
##   <chr> <lgl>    <int>
## 1 F     FALSE      154
## 2 F     TRUE        41
## 3 M     FALSE      105
## 4 M     TRUE        70

Males are somewhat over represented in the high alcohol usage group, as hypothesised.

alc_subset %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
## # Groups:   sex [2]
##   sex   high_use count mean_grade
##   <chr> <lgl>    <int>      <dbl>
## 1 F     FALSE      154       11.4
## 2 F     TRUE        41       11.8
## 3 M     FALSE      105       12.3
## 4 M     TRUE        70       10.3

High alcohol use seems to somewhat lower the grades in males, but not in females.

# Some of the exercise code fit with my hypothesis nicely...
g1 <- ggplot(alc_subset, aes(x = high_use, y = G3))
g1 + geom_boxplot(aes(col = sex)) + ylab("grade")

And the same can be seen from the above plot.

alc_subset %>% group_by(high_use) %>% summarise(mean_goout = mean(goout))
## # A tibble: 2 × 2
##   high_use mean_goout
##   <lgl>         <dbl>
## 1 FALSE          2.85
## 2 TRUE           3.73

Going out seems to quite clearly correlate with high alcohol usage, as was the hypothesis.

alc_subset %>% group_by(high_use) %>% summarise(mean_absences = mean(absences))
## # A tibble: 2 × 2
##   high_use mean_absences
##   <lgl>            <dbl>
## 1 FALSE             3.71
## 2 TRUE              6.38

And alcohol use and absences correlate.

g2 <- ggplot(alc_subset, aes(x = high_use, y = absences))
g2 + geom_boxplot(aes(col = sex)) + ylab("absences") +
  ggtitle("Student absences by alcohol consumption and sex")

But gender doesn’t seem to make a huge difference here. Still, alhocol use in males does correlate with absences more clearly.

Logistical regression models

m <- glm(high_use ~ absences + goout + sex + G3, data = alc_subset, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + goout + sex + G3, family = "binomial", 
##     data = alc_subset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9500  -0.7990  -0.5263   0.8110   2.4703  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.58037    0.70357  -5.089 3.60e-07 ***
## absences     0.08263    0.02267   3.645 0.000268 ***
## goout        0.70551    0.12167   5.799 6.68e-09 ***
## sexM         1.01859    0.25986   3.920 8.87e-05 ***
## G3          -0.04508    0.03985  -1.131 0.258029    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 372.82  on 365  degrees of freedom
## AIC: 382.82
## 
## Number of Fisher Scoring iterations: 4

absences, goout, and gender all strongly correlate with high_use, as demonstrated by the low p-scores, but G3 clearly does not. So, let’s drop G3 from the model:

m <- glm(high_use ~ absences + goout + sex, data = alc_subset, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + goout + sex, family = "binomial", 
##     data = alc_subset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8060  -0.8090  -0.5248   0.8214   2.4806  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.18142    0.48085  -8.696  < 2e-16 ***
## absences     0.08478    0.02266   3.741 0.000183 ***
## goout        0.72793    0.12057   6.038 1.56e-09 ***
## sexM         1.02223    0.25946   3.940 8.15e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 374.10  on 366  degrees of freedom
## AIC: 382.1
## 
## Number of Fisher Scoring iterations: 4

And add coefficients and confidence intervals:

OR <- coef(m) %>% exp
CI <- confint(m) %>% exp()
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR       2.5 %     97.5 %
## (Intercept) 0.0152768 0.005706586 0.03774415
## absences    1.0884786 1.042536998 1.14083818
## goout       2.0707952 1.644519878 2.64113883
## sexM        2.7793948 1.682230346 4.66256127

Alcohol consumption seem to here have a significantly lower effect on on absences, than how the results earlier seemed. But the scale for absences is a lot more granular than for the other variables here, so that might have an effect? Or maybe the effect isn’t as pronounced as on the other variables. There’s roughly 50% chance (1:1 odds) that there’s a unit change in absences if alcohol consumption state flips from high to low. For gender that chance is ~75% (2.78:1) and ~65% chance for out-goingness.

Cross tabulation of predictions

probabilities <- predict(m, type = "response")
alc_subset <- mutate(alc_subset, probability = probabilities)
alc_subset <- mutate(alc_subset, prediction = probability > 0.5)
raw_num <- table(high_use = alc_subset$high_use, prediction = alc_subset$prediction)
raw_num
##         prediction
## high_use FALSE TRUE
##    FALSE   242   17
##    TRUE     61   50
table(high_use = alc_subset$high_use, prediction = alc_subset$prediction) %>%
  prop.table() %>%
  addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65405405 0.04594595 0.70000000
##    TRUE  0.16486486 0.13513514 0.30000000
##    Sum   0.81891892 0.18108108 1.00000000
precision <- raw_num[2,2]/sum(raw_num[,2])
precision
## [1] 0.7462687
recall <- raw_num[2,2]/sum(raw_num[2,])
recall
## [1] 0.4504505

The model doesn’t seem that great. Precision was 75%, that is only 3/4 of the positive predictions were accurate. Recall is 45%, so more than half of actual high_use cases were undetected.

1 - (sum(alc_subset$high_use == alc_subset$prediction) / nrow(alc_subset))
## [1] 0.2108108

Total proportion of inaccurate classifications is ~21%. That’s not too great a result. Seems like alcohol consumption does not explain all other behaviour, but still a surprisingly high number of predictions are correct. The relatively good result is probably because the model seems to err on the side of not high use, and as that is the more likely result the number of accurate predictions is misleadingly high. Precision and recall, (which were not great) are probably better indicators.

If we try predicting with the three variables we decided were significant, by predicting high alcohol consumption if at least two of the three are on the side of the indicator favoring high alcohol consumption, the results look like this:

alc_subset$goout_guess <- alc_subset$goout > mean(alc_subset$goout)
alc_subset$abs_guess <- alc_subset$absences > mean(alc_subset$absences)
alc_subset$gender_guess <- alc_subset$sex == "M"
alc_subset$ultimate_guess <- (alc_subset$abs_guess + alc_subset$goout_guess + alc_subset$gender_guess) >= 2
1 - (sum(alc_subset$high_use == alc_subset$ultimate_guess) / nrow(alc_subset))
## [1] 0.2810811

We got 28% wrong, which is a bit higher than the model’s 21%, but not that high. I wouldn’t base students’ alcoholism intervention program on either.


Week 4 exercises

Didn’t have time to finish these this week! Less work vfor the reviewer I guerss. I did a few of the starting points, and will finalize these later.

The Boston dataset

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The dataset has various socio-political data of Boston suburbs. The data varies from crime rates to property values and student/teacher ratios in schools.

Dataset summary

library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
library(ggplot2)
library(GGally)

cor_matrix <- cor(Boston) %>% round(2) 
cor_matrix 
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
ggpairs(Boston)

corrplot(cor_matrix, method="circle", type="upper")

Boston %>% ggplot(aes(indus)) +
  geom_histogram(binwidth = 1) + 
  ylab("Industry ratio")

Boston %>% ggplot(aes(rad)) +
  geom_histogram(binwidth = 1) + 
  ylab("Radial highway access")

Boston %>% ggplot(aes(tax)) +
  geom_histogram(binwidth = 20) +
  ylab("Property tax rate")

Boston %>% ggplot(aes(age)) +
  geom_histogram(binwidth = 10) +
  ylab("Building age")

Few variables have a very strongly stratified scope. Some suburbs have very high values, most others very low. See for example crime rate (crim), river proximity (chas) and proportion of people of colour of the residents (black).

Industry rate seems to correlate (nagatively or positively) strongly with many of the other variables in the dataset. The histograms of the correlating variables vary in shape quite widely, so probably there’s quite a bit of variance in suburbs on these.

Standardized data, training and testing sets

library(dplyr)

boston_scaled <- scale(Boston) %>% as.data.frame()
boston_scaled$crim <- as.numeric(boston_scaled$crim)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
crime <- cut(boston_scaled$crim, breaks = quantile(boston_scaled$crim), include.lowest = TRUE,
             labels = c("low", "med_low", "med_high", "high"))

boston_scaled$crim <- crime

# Create test and train datasets
n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

All the variables were standardized with the scale() -function. The “crim” (crime rate) variable in the Boston dataset was overwritten with the standardized categorized version.

LDA

lda.fit <- lda(crim ~ ., data = train)
lda.fit
## Call:
## lda(crim ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2301980 0.2623762 0.2574257 0.2500000 
## 
## Group means:
##                   zn      indus          chas        nox          rm        age
## low       1.07662268 -0.8756570 -0.0606570120 -0.8759553  0.51487993 -0.8876639
## med_low  -0.09224468 -0.2624439 -0.0494743374 -0.5626278 -0.16723519 -0.3177981
## med_high -0.39571399  0.1720379  0.2198084618  0.3522210  0.06106323  0.4295536
## high     -0.48724019  1.0149946  0.0005392655  1.0633267 -0.39913112  0.8220915
##                 dis        rad        tax     ptratio       black       lstat
## low       0.8300121 -0.7114257 -0.7590974 -0.47613319  0.38344206 -0.79915663
## med_low   0.3323186 -0.5539047 -0.4470287 -0.06399813  0.32074100 -0.09496593
## med_high -0.3664995 -0.3943861 -0.3045816 -0.27747853  0.08931443  0.04284730
## high     -0.8568122  1.6596029  1.5294129  0.80577843 -0.73675830  0.84681175
##                 medv
## low       0.61081524
## med_low  -0.03516024
## med_high  0.12659518
## high     -0.65486976
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.10284674  0.914141287 -0.91485721
## indus   -0.03198100 -0.120063354  0.05655135
## chas    -0.07007109 -0.067687771  0.02082257
## nox      0.32522997 -0.605498161 -1.36071190
## rm      -0.10737096 -0.032071546 -0.19245723
## age      0.32556594 -0.280111231 -0.22307242
## dis     -0.10508127 -0.315843723  0.21010456
## rad      3.31208603  0.941396333 -0.23390545
## tax     -0.10493507 -0.167488489  1.01614278
## ptratio  0.15609517  0.118335295 -0.27741488
## black   -0.16571792  0.001986836  0.12649416
## lstat    0.14463703 -0.232461112  0.29529300
## medv     0.15586539 -0.265118810 -0.23143007
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9480 0.0367 0.0153
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}


classes <- as.numeric(train$crim)

plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

LDA predictions

test_crime <- test$crim

test_lda <- dplyr::select(test, -crim)

lda.pred <- predict(lda.fit, newdata = test_lda)
table(correct = test$crim, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       16      17        1    0
##   med_low    2      13        5    0
##   med_high   0       4       18    0
##   high       0       0        1   25

Looks like the ‘low’ -category got mixed predictions of low and med_low, while in the other categories the predictions fit the test data correctly. The sample size is quite small, and the actual values of the categorized variable might be close to the dividing line in the low/med_low area. Or the explaining variables might not cover all the actual reasons for high crime rates. It’s very unlikely that they would.

K-means (unfinished)

boston_scaled <- scale(Boston) %>% as.data.frame()

Reload the Boston dataset and standardize the dataset (we did not do this in the Exercise Set, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results. (0-4 points)


Week 5 exercises

Data summary

human <- read.table(file = "data/human.csv", row.names = 1, sep = ",", header = TRUE)
summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
library(tidyr)
library(corrplot)
library(ggplot2)
library(GGally)

cor_matrix <- cor(human) %>% round(2) 
cor_matrix 
##           Edu2.FM Labo.FM Edu.Exp Life.Exp   GNI Mat.Mor Ado.Birth Parli.F
## Edu2.FM      1.00    0.01    0.59     0.58  0.43   -0.66     -0.53    0.08
## Labo.FM      0.01    1.00    0.05    -0.14 -0.02    0.24      0.12    0.25
## Edu.Exp      0.59    0.05    1.00     0.79  0.62   -0.74     -0.70    0.21
## Life.Exp     0.58   -0.14    0.79     1.00  0.63   -0.86     -0.73    0.17
## GNI          0.43   -0.02    0.62     0.63  1.00   -0.50     -0.56    0.09
## Mat.Mor     -0.66    0.24   -0.74    -0.86 -0.50    1.00      0.76   -0.09
## Ado.Birth   -0.53    0.12   -0.70    -0.73 -0.56    0.76      1.00   -0.07
## Parli.F      0.08    0.25    0.21     0.17  0.09   -0.09     -0.07    1.00
ggpairs(human)

corrplot(cor_matrix, method="circle", type="upper")

There’s quite strong correlation between many of the variables in the data. Distributions vary, but often the shapes of the distribution historgrams are similar- for example between Mat.Mor and GNI, or between Labo.FM, Edu.Exp and Life.Exp.

For refenrece, the data description from https://github.com/KimmoVehkalahti/Helsinki-Open-Data-Science/blob/master/datasets/human_meta.txt :

Health and knowledge

“GNI” = Gross National Income per capita “Life.Exp” = Life expectancy at birth “Edu.Exp” = Expected years of schooling “Mat.Mor” = Maternal mortality ratio “Ado.Birth” = Adolescent birth rate

Empowerment

“Parli.F” = Percetange of female representatives in parliament “Edu2.F” = Proportion of females with at least secondary education “Edu2.M” = Proportion of males with at least secondary education “Labo.F” = Proportion of females in the labour force “Labo.M” ” Proportion of males in the labour force

“Edu2.FM” = Edu2.F / Edu2.M “Labo.FM” = Labo2.F / Labo2.M

PCA

raw data

Perform principal component analysis (PCA) on the raw (non-standardized) human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)

pca_human <- prcomp(human)
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
## 
## Rotation (n x k) = (8 x 8):
##                     PC1           PC2           PC3           PC4           PC5
## Edu2.FM   -5.607472e-06  0.0006713951 -3.412027e-05 -2.736326e-04 -0.0022935252
## Labo.FM    2.331945e-07 -0.0002819357  5.302884e-04 -4.692578e-03  0.0022190154
## Edu.Exp   -9.562910e-05  0.0075529759  1.427664e-02 -3.313505e-02  0.1431180282
## Life.Exp  -2.815823e-04  0.0283150248  1.294971e-02 -6.752684e-02  0.9865644425
## GNI       -9.999832e-01 -0.0057723054 -5.156742e-04  4.932889e-05 -0.0001135863
## Mat.Mor    5.655734e-03 -0.9916320120  1.260302e-01 -6.100534e-03  0.0266373214
## Ado.Birth  1.233961e-03 -0.1255502723 -9.918113e-01  5.301595e-03  0.0188618600
## Parli.F   -5.526460e-05  0.0032317269 -7.398331e-03 -9.971232e-01 -0.0716401914
##                     PC6           PC7           PC8
## Edu2.FM    2.180183e-02  6.998623e-01  7.139410e-01
## Labo.FM    3.264423e-02  7.132267e-01 -7.001533e-01
## Edu.Exp    9.882477e-01 -3.826887e-02  7.776451e-03
## Life.Exp  -1.453515e-01  5.380452e-03  2.281723e-03
## GNI       -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor    1.695203e-03  1.355518e-04  8.371934e-04
## Ado.Birth  1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F   -2.309896e-02 -2.642548e-03  2.680113e-03
summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
biplot(pca_human, choices = 1:2, col = c("grey40", "deeppink2"), cex = c(0.8, 1))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

standardized data

human_std <- scale(human)
pca_human_std <- prcomp(human_std)
pca_human_std
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation (n x k) = (8 x 8):
##                   PC1         PC2         PC3         PC4        PC5
## Edu2.FM   -0.35664370  0.03796058 -0.24223089 -0.62678110  0.5983585
## Labo.FM    0.05457785  0.72432726 -0.58428770 -0.06199424 -0.2625067
## Edu.Exp   -0.42766720  0.13940571 -0.07340270  0.07020294 -0.1659678
## Life.Exp  -0.44372240 -0.02530473  0.10991305  0.05834819 -0.1628935
## GNI       -0.35048295  0.05060876 -0.20168779  0.72727675  0.4950306
## Mat.Mor    0.43697098  0.14508727 -0.12522539  0.25170614  0.1800657
## Ado.Birth  0.41126010  0.07708468  0.01968243 -0.04986763  0.4672068
## Parli.F   -0.08438558  0.65136866  0.72506309 -0.01396293  0.1523699
##                   PC6         PC7         PC8
## Edu2.FM    0.17713316  0.05773644  0.16459453
## Labo.FM   -0.03500707 -0.22729927 -0.07304568
## Edu.Exp   -0.38606919  0.77962966 -0.05415984
## Life.Exp  -0.42242796 -0.43406432  0.62737008
## GNI        0.11120305 -0.13711838 -0.16961173
## Mat.Mor    0.17370039  0.35380306  0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F    0.13749772  0.00568387 -0.02306476
summary(pca_human_std)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
biplot(pca_human_std, choices = 1:2, col = c("grey40", "deeppink2"))

Standardizing the variables brought their properties in the PCA much closer together. The plot of the non-standardized PCA is almost unreadable- the only variable that stands out is GNI, and that should follow from the fact that the absolute values there are way higher than in the other variables.

In the standardized version, few clear directions are discernible. Maternal mortality ratio and Adolescent birth rate clearly have similar pull, both related to births. The gender balance in labour and percentage of female representatives in parliament also have an effect in the same direction. Rest of the variables (GNI, Life expectancy at birth, Expected years of schooling and gender balance in secondary education) all form a third group.

PC2 creates an axis, along which the variables connected with birth (maternal mortality and adolescence birth rate) and the other group connected with more general metrics for quality of life (GNI, Life expectancy at birth, Expected years of schooling) and gender balance in secondary education form opposite poles. This makes sense, as the first group would strongly correlate with large portion of the population living in poor conditions, and high GNI, etc would on the other hand correlate with relatively properous societies.

PC1 is more connected with gender equality, and does not seem to directly correlate with high GNI, but there the two variables (gender balance in parliament and in work force) are less strongly correlated than in the other groupings. This might make sense- you could have a relatively conservative and low income society where women are employed in relatively low status jobs, but required to work still as the total income of the economic unit would otherwise be too low. Parliamentary representation on the other hand correlates more with general indicators of wealth (GNI, etc.), and this would indicate that a more equal societies in terms of power relations are more often also more prosperous.

Tea

tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
dim(tea)
## [1] 300  36
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
ggpairs(tea[,c("age", "how", "where")])
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(gather(tea), aes(value)) + 
    geom_histogram(stat="count") + 
    facet_wrap(~key, scales = 'free_x')
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`

Did have a quick look at some selected correlations with ggpairs. Also histograms of all the columns.

tea_subset <- tea[,c("spirituality", "healthy", "friendliness", "relaxing", "sophisticated", "exciting", "feminine")]
tea_subset$healthy <- relevel(tea_subset$healthy, "Not.healthy") 
tea_subset$friendliness <- relevel(tea_subset$friendliness, "Not.friendliness") 
tea_subset$exciting <- relevel(tea_subset$exciting, "No.exciting") 
tea_subset$feminine <- relevel(tea_subset$feminine, "Not.feminine") 

library(FactoMineR)
mca <- MCA(tea_subset, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_subset, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.208   0.185   0.154   0.135   0.120   0.109   0.089
## % of var.             20.786  18.492  15.444  13.509  12.033  10.880   8.856
## Cumulative % of var.  20.786  39.279  54.723  68.232  80.265  91.144 100.000
## 
## Individuals (the 10 first)
##                      Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1                 |  0.996  1.591  0.653 | -0.211  0.080  0.029 | -0.082  0.014
## 2                 |  0.951  1.449  0.546 |  0.333  0.200  0.067 |  0.068  0.010
## 3                 |  0.375  0.225  0.174 | -0.600  0.649  0.446 | -0.219  0.103
## 4                 |  0.034  0.002  0.001 | -0.468  0.395  0.167 |  0.837  1.511
## 5                 |  0.286  0.131  0.062 | -0.314  0.178  0.074 |  0.539  0.628
## 6                 |  0.781  0.978  0.446 | -0.686  0.849  0.344 |  0.183  0.072
## 7                 |  0.781  0.978  0.446 | -0.686  0.849  0.344 |  0.183  0.072
## 8                 | -0.458  0.336  0.360 | -0.384  0.266  0.253 | -0.400  0.345
## 9                 |  0.545  0.477  0.223 | -0.284  0.146  0.061 |  0.464  0.465
## 10                |  0.375  0.225  0.174 | -0.600  0.649  0.446 | -0.219  0.103
##                     cos2  
## 1                  0.004 |
## 2                  0.003 |
## 3                  0.059 |
## 4                  0.534 |
## 5                  0.219 |
## 6                  0.024 |
## 7                  0.024 |
## 8                  0.274 |
## 9                  0.161 |
## 10                 0.059 |
## 
## Categories (the 10 first)
##                       Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## Not.spirituality  |   0.300   4.245   0.197   7.677 |  -0.048   0.123   0.005
## spirituality      |  -0.657   9.303   0.197  -7.677 |   0.105   0.269   0.005
## Not.healthy       |   0.472   4.588   0.095   5.340 |   0.495   5.668   0.105
## healthy           |  -0.202   1.966   0.095  -5.340 |  -0.212   2.429   0.105
## Not.friendliness  |   1.045  14.518   0.262   8.849 |  -0.210   0.658   0.011
## friendliness      |  -0.251   3.480   0.262  -8.849 |   0.050   0.158   0.011
## No.relaxing       |   0.428   4.731   0.110   5.747 |   0.892  23.133   0.480
## relaxing          |  -0.258   2.859   0.110  -5.747 |  -0.539  13.979   0.480
## Not.sophisticated |   1.022  20.330   0.413  11.109 |  -0.361   2.855   0.052
## sophisticated     |  -0.404   8.037   0.413 -11.109 |   0.143   1.129   0.052
##                    v.test     Dim.3     ctr    cos2  v.test  
## Not.spirituality   -1.231 |  -0.487  15.079   0.520 -12.472 |
## spirituality        1.231 |   1.068  33.046   0.520  12.472 |
## Not.healthy         5.598 |   0.371   3.813   0.059   4.196 |
## healthy            -5.598 |  -0.159   1.634   0.059  -4.196 |
## Not.friendliness   -1.777 |   0.890  14.168   0.190   7.535 |
## friendliness        1.777 |  -0.213   3.396   0.190  -7.535 |
## No.relaxing        11.985 |  -0.453   7.160   0.124  -6.093 |
## relaxing          -11.985 |   0.274   4.326   0.124   6.093 |
## Not.sophisticated  -3.926 |  -0.175   0.806   0.012  -1.906 |
## sophisticated       3.926 |   0.069   0.319   0.012   1.906 |
## 
## Categorical variables (eta2)
##                     Dim.1 Dim.2 Dim.3  
## spirituality      | 0.197 0.005 0.520 |
## healthy           | 0.095 0.105 0.059 |
## friendliness      | 0.262 0.011 0.190 |
## relaxing          | 0.110 0.480 0.124 |
## sophisticated     | 0.413 0.052 0.012 |
## exciting          | 0.005 0.637 0.040 |
## feminine          | 0.373 0.005 0.135 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic")

Looks like Dim2 reveals, that there’s a tendency among the respondents to either assign multiple attributes to tea, or not assign any. Dim 1 reveals that exciting and ‘not relaxing’ are paired, and the other way round. Rest of the attributes have less strong associations with other attributes, but there are certain groupings: sophistication, femininess and spiritulaity are grouping close to each other, as well as not-friendliness and unsophistication.


Week 6 exercises

Couldn’t finish these in time. Apologies to whoever is checking this.

RATS <- read.table(file = "data/rats.csv", sep = ",", header = TRUE)
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)

BPRS <- read.table(file = "data/bprs.csv", sep = ",", header = TRUE)
BPRS$treatment <- factor(BPRS$treatment) 
BPRS$subject <- factor(BPRS$subject) 

Overview of RATS

library(ggplot2)

# Draw the plot
ggplot(RATS, aes(x = Time, y = weight, color = ID, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATS$weight), max(RATS$weight)))

Might have made more sense to have them all in one graph, with linetype = Group instead of ID.

library(dplyr)
library(tidyr)

RATS <- RATS %>%
  group_by(Time) %>%
  mutate(stdweight = (weight - mean(weight))/ sd(weight)) %>%
  ungroup()

# Plot again with the standardized
ggplot(RATS, aes(x = Time, y = stdweight, color = ID, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:16, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "standardized bprs")

Standardized version.

Summary of rat weights per week

n <- 16

# Summary data with mean and standard error of weight by treatment and week 
RATSS <- RATS %>%
  group_by(Group, Time) %>%
  summarise(mean = mean(weight), se = sd(weight)/n ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Plot the mean profiles
library(ggplot2)
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.9,0.5)) +
  scale_y_continuous(name = "mean(weight) +/- se(weight)")

And mean weights:

RATS8S <- RATS %>%
  group_by(Group, ID) %>%
  summarise(mean=mean(weight)) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
library(ggplot2)
ggplot(RATS8S, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(weight)")

Had to filter outliers from each Group of rats separately.

RATS8S_filtered <- RATS8S %>% filter(!(Group == 1 & mean < 250) &
                                     !(Group == 2 & mean > 550) &
                                     !(Group == 3 & mean < 500))

ggplot(RATS8S_filtered, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(bprs), weeks 1-8")

… and the groupings are very tight as a results of filtering. I wonder if it did really make sense?

T-test, Anova

t-test only makes sense pairwise, so we’ll filter for groups. Could make a t-test of group 1 vs group 3 too, but that’d show even more obvious significance than these tests.

RATS12 <- RATS8S_filtered %>% filter(Group == 1 | Group == 2)
t.test(mean ~ Group, data = RATS12, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -41.656, df = 8, p-value = 1.215e-10
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
##  -192.089 -171.937
## sample estimates:
## mean in group 1 mean in group 2 
##        267.4416        449.4545
RATS23 <- RATS8S_filtered %>% filter(Group == 2 | Group == 3)
t.test(mean ~ Group, data = RATS23, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -18.428, df = 4, p-value = 5.102e-05
## alternative hypothesis: true difference in means between group 2 and group 3 is not equal to 0
## 95 percent confidence interval:
##  -100.45660  -74.14946
## sample estimates:
## mean in group 2 mean in group 3 
##        449.4545        536.7576
# Add the baseline from the original data as a new variable to the summary data
RATS_bl <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')


RATS8S_non_filtered <- RATS8S %>% mutate(baseline = RATS_bl$WD1)

# Fit the linear model with the mean as the response 
fit <- lm(mean ~ baseline + Group, data = RATS8S_non_filtered)

# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value    Pr(>F)    
## baseline   1 252125  252125 2237.0655 5.217e-15 ***
## Group      2    726     363    3.2219   0.07586 .  
## Residuals 12   1352     113                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looks like Group matters, as was evident from the graphs too… but it was practically all determined by the baseline. The p-value of 0.07 for ‘Group’ barely registers as signifying at all.